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1 Velvet command line parameters 

We assembled the libraries generated by NxTrim using Velvet (version 1.2.10) with the following com- 
mands; 

velveth output_dir ${k} -short -fastq.gz ${pref ix} . se . f astq. gz \ 
-shortPaired2 -fastq.gz ${prefix}.pe. fastq.gz \ 
-shortPaired.3 -fastq.gz ${pref ixj.mp. fastq.gz \ 
-shortPaired4 -fastq.gz ${prefix}. unknown. fastq.gz 

velvetg output_dir -exp_cov auto -cov_cutoff auto -shortMatePaired4 yes 

where ${k} is the k-mer size used and ${pref ix} is simply the sample name. We performed assemblies 
across a range of k-mers (21,119), choosing the assembly with the largest contig N50 for each sample. 
The -shortMatePaired4 yes argument flags the unknown library as possibly containing contaminants 
and was found to improve assembly quality. 

For the MiSeq Reporter trimmed reads, we only have one library. We used similar commands: 

velveth output_dir ${k} -shortPaired2 -fastq.gz ${pref ix} . f astq. gz 
velvetg output_dir -exp_cov auto -cov_cutoff auto -shortMatePaired2 yes 

again we searched over the same range of k-mers, taking the largest contig N50 for each sample. 

These commands were inspired by Dr. Torsten Seemann's blogM 
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http : //thegenomef actory .blogspot . co .uk/2012/09/using- velvet- with-mate-pair- sequences .html 
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2 Adapter trimming logic 



We now outline the logic behind our adapter trimming and virtual library creation routine. We de- 
scribe the metric we use for adapter detection in section |2.1| and the library assignment of read pairs in 
section 12.21 




A. Standard mate pair orientation 
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i | C. Paired end orientation (reads can be joined) 




D. Mate pair (or paired end) and a single read 




E. Unknown orientation (mate pair) 
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F. Unknown orientation (paired end contaminant) 
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Figure SI: Enumeration of the different trimming scenarios. The blue areas represent genomic DNA 
whilst the yellow area is the artificial Nextera Mate Pair adapter sequence which needs to be removed. 
Genomic DNA on opposite sides of the adapter are from physically distant (ss4kb on average) genomic 
locations with Reverse- Forward orientation. The arrows represent the sequence assayed by the reads 
Rl and R2. The dashed boxes represent the sequence that will be kept after trimming the adapter 
sequence from the reads. When present (and detected), the location of the adapter informs us about 
the orientation and physical distance of the reads allowing them to categorise them into virtual libraries. 
Cases A and B are the most common, resulting in an obvious mate-pair (MP) or paired-end (PE) read 
pair respectively. In case C no adapter was present, but the overlap between reads tells us that the reads 
are forward-reverse orientated with a paired end distance (and can optionally be joined into a longer 
read). Case D allows us to produce either an MP or PE pair, we choose the pair with the longest read 
lengths, the overhang is stored as a single ended read if it is long enough. In cases E and F we have no 
information about the read orientation, typically these are MP reads (E) but there is a small amount of 
PE contamination (F). 
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2.1 Adapter detection 



We search for adapter sequence within a read pair (Ri,R%) containing Lr bases in each read. The 
Nextera Mate Pair adapter sequences are: 

• Ax = CTGTCTCTTATACACATCT 

• A 2 = AGATGTGTATAAGAGACAG 



The concatenation of these two strings, A\ + A 2 , is the yellow region in Figure SI and it (or at least a 
substring of it) will be what is typically observed in one or both the reads in a pair. Due to substitu- 
tion (and sometimes indel) errors in sequencing we need a detection routine that allows for imperfect 
matches. Note the adapter sequence is the reverse-complement of itself so we do not need to worry about 
strandedness. We describe the search for a single read R as the detection routine is the same for both 
reads. 

For both A\ and A 2 , we slide a 19-mer window across R and compute the Hamming distance between 
each window and the adapter, accounting for partial matches to the adapter by allowing the window to 
shrink as small as 12-bases at the extremities of the read and comparing with the appropriate substring 
of the adapter. If the smallest distance is below a threshold derived from a similarity measure p = 0.85, 
we consider the adapter detected and return the indices (a, b) of R that contain it. Checking for each 
adapter separately may seem redundant, but we do this for two reasons: 

• Occasionally, the DNA fragments manage to circularise with only one adapter present. 

• Hamming distances will not allow us to detect adapter sequence containing indel errors. Since these 
are rare in Illumina data, the chances of seeing more than one across the 38bp merged adapter is 
extremely low. Checking each half separately allows the presence of one half (with an indel error) 
to be inferred by the presence of another (with no indel error). 

Both of these outcomes are very rare, but the 19-mer adapter is long enough to provide high specificity 
so no accuracy is lost by only checking for one half of the adapter rather than the full 38bp length. 

The algorithms are described formally below, we assume the Hamming distance function is already 
defined. 
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Algorithm 1 Adapter/overlap detection routines 



function AlignString(-R,A) 
MIN d = L A 
MINi = NA 

for i e [-{La - L M ), L R - (L A - L M )\ do 
if i < 0 then 

d=HAMMiNG(i?[0, i + L A ],A[L A + i, L A ]) 

L c — i + L A 
else if i > L/j then 

d=HAMMlNG(J?[z, Lfi],A[0,Lfl - i]) 

£c = £fl - « 
else 

d=HAMMiNG(i?[i, i + L A ],A[0, L A \) 

Lc — L A 
end if 

if d < ((1 - p) x (L c )) and d < MIN d then 
MINi = i 
MIN d = d 
end if 
end for 
return MIN l 
end function 



function DetectAdapter(_R) 
a =AlignString(-R,Ai) 
if a ^ NA then 

return (a, a + 2L A ) 
end if 

a = AlignString(-R,Ai) 
if a ^ NA then 

return (a — L A , a + La) 
end if 

return (NA, NA) 
end function 



function Overlap (Ri,R 2 ) 
for i e [0, L R — Lm] do 

(i=HAMMiNG(i?i [i, L R ],R 2 [0,L R - i]) 
if d < ((1 - p) x (L fl - i)) then 

return True 
end if 
end for 
return False 
end function 
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2.2 Virtual library classification 



After detecting the adapter (if present) in each read we trim and categorise the read-pairs. We define 
(di,6i) and (02,^2) as the beginning (a) and end (b) indices of the adapter sequence for reads R\ and 
i?2 respectively. Note that we allow a to be negative and b to exceed the read length (b > Lr). Since 
the adapter may only be partially sequenced at the start (or end) of a read. We also define the variable 
Lm = 12, reads smaller than Lm after trimming will be discarded. 

We first try to detect if the adapter is present in cither (or both) reads. If it is no adapter is found, we 
check if the reads overlap. If the reads overlap the read-pair is PE (case C) else it is UNKNOWN (case 
E or F). If the adapter is found we return a MP if the majority of genomic DNA is on opposite ends of 
the adapter (case A) , else if most of the genomic DNA occurs on the same side of the adapter we return 
a PE (case B). If the adapter splits a read near the centre, we may also return a separate single read in 
addition to the PE and SE library (case D). 
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A formal description of this logic follows: 



Algorithm 2 Adapter trimming and virtual library creation 



function TrimReads(.Ri,.R 2 ) 
SE=NA 
PE=NA 
MP=NA 

UNKNOWN^NA 

(ai,6i) = find Adapter(_Ri ) 

(a 2 ,b 2 ) = FINDADAPTER(i? 2 ) 

if ai = NA and a 2 = NA then 

if OVERLAP(i?!,REVERSECOMPLEMENT(i? 2 )) then 

PE = [R U R 2 ) 

return (SE,PE,MP,UNKNOWN) 
end if 
end if 

if A x = NA and A 2 ^ NA then 

UNKNOWN=(i?!,ii 2 ) 
else if A x ^ NA and A 2 < M then 

SE = i?i[0,ai] 
else if A x = NA and A 2 < M then 

SE = R 2 [0,a 2 ] 
else if ai < Lr and a 2 < Lr and a 2 = NA then 

MP = (J? 1 [0,a 1 ], J R 2 [0,a 2 ]) 
else if ai < Lr and b\ > Lr and a 2 — NA then 

MP = (R 1 [0,a 1 ],R 2 ) 
else if a 2 < Lr and b 2 > Lr and ai = NA then 

MP = (R 1 ,R 2 [0,a 2 ]) 
else if bi < Lr and a 2 — NA then 

(SE,PE,MP) = RESOLVEOVERHANG(i?i,i? 2 ,ai,6i) 

else if b 2 < Lr and a\ — NA then 

(SE,PE,MP) = RESOLVEOVERHANG(i? 2;J Ri,a 2 ,6 2 ) 

end if 

MP = ReverseComplement(MP) > Puts mate-pairs 

UNKNOWN = ReverseComplement(UNKNOWN) 
return (SE,PE,MP,UNKNOWN) 
end function 



> No adapters found 
t> R 2 redundant 
> Ri redundant 
> both reads have adapter 

> i?i has adapter at end 

> R 2 has adapter at end 
t> Ri has an overhang 
t> R 2 has an overhang 

in Forward-Reverse orientation 



function RESOLVEOVERHANG(i?i,i? 2 ,a,&) 

SE^NA 

pe^na 
mp=na 

if a < (Lr — b) then 

PE=(i2i[6,L fl ],iJ2) 

if a < M) then 
SE =J Ri[0,a] 

end if 
else 

MP=(iii[0,o],ii 2 ) 

if (L r -b)>M then 

SE=R 1 [b,L R ] 
end if 
end if 

return (SE,PE,MP) 
end function 



> PE has bigger reads than MP 
> Create a SE if the overhang is big enough 

> MP has bigger reads than PE 
[> Create a SE if the overhang is big enough 
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3 Supplementary Tables 



Abbreviation: 


Beer 


Bacteria: 


Bacillus ce.re.us ATCC 10987 


Accession ID: 


NC_003909, NC_005707 


X ' /" ' I ) 1 I - - 1 ' 1 1 


ftp . ncbi . nih . gov/ genomes/Bacteria/Bacillus_cereus_ATCC_10987_uid57673/ 


Abbreviation: 


EcDH 


Bacteria: 


Escherichia coli str. K-12 substr. DH10B 


Accession ID: 


NC_010473 


TV T /""I T"» T TUP 1111 

NCBI FTP: 


ftp . ncbi . nih . gov/genomes/Bacteria/Escherichia_coli_K_12_substr DH10B_uid58979/ 


Abbreviation: 


EcMG 


Bacteria: 


Escherichia coli str. K-12 substr. MG1655 


Accession ID: 


NC_000913 


NCBI FTP: 


ftp . ncbi . nih . gov/genomes/Bacteria/Escherichia_coli_K_12_substr MG1655_uid57779/ 


Abbreviation: 


list. 


Bacteria: 


Listeria monocytogenes 


Accession ID: 


NC_003210 


NCBI FTP: 


ftp . ncbi . nih . gov/genomes/Bacteria/Listeria_monocytogenes_EGD_e_uid61583/ 


Abbreviation: 


meio 


Bacteria: 


Meiothermus ruber DSM 1279 


Accession ID: 


NC-013946 


NCBI FTP: 


ftp . ncbi . nih . gov/genomes/Bacteria/Meiothermus_ruber_DSM_1279_uid46661/ 


Abbreviation: 


ped 


Bacteria: 


Pedobacter heparinus DSM 2366 


Accession ID: 


NC_013061 


V f ' I M I. ' 1 I > . 
> V I J 1 I 1 I . 


ftp. ncbi. nih. gov/ genome s/Bact eiria/Pedobact er_heparinus_DSM_2366_uid59 1 11/ 


Abbreviation: 


pneu 


Bacteria: 


Klebsiella pneumoniae subsp. pneumoniae MGH 78578 


Accession ID: 


NC_009648, NC_009649, NC_009650, NC.009651, NC_009652, NC_009653 


NCBI FTP: 


ftp . ncbi . nih . gov/genomes/Bacteria/Klebsiella_pneumoniae_MGH_78578_uid57619/ 


Abbreviation: 


rhod 


Bacteria: 


Rhodobacter sphaeroides 2.4-1 


Accession ID: 


NC-007488, NC-007489, NC-007490, NC-007493, NC-007494, NC-009007, NC-009008 


NCBI FTP: 


ftp . ncbi . nih . gov/genomes/Bacteria/Rhodobacter_sphaeroides_2_4_l_uid57653/ 


Abbreviation: 


TB 


Bacteria: 


Mycobacterium tuberculosis H37Ra 


Accession ID: 


NC_009525 


NCBI FTP: 


ftp . ncbi . nih . gov/genomes/Bacteria/Mycobacter ium_tuberculosis_H37Ra_uid58853/ 



Table SI: Summary of bacteria analysed and the relevant NCBI information on their reference genomes. 
There were two repeats of each strain. All 18 samples were prepared with the Nextera Mate Pair 
protocol and sequenced in a single MiSeq run using 2xl51bp reads. The untrimmed reads we used as 
input to NxTrim (3.9Gbp in all) are available from BaseSpace via https : / /basespac e . illumina. com/| 
s/TXv32Ve6wT19 (free registration required). 
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Reference 
size (kb) 


Coverage 
depth 


k-mer 
size 


Reference 
recovered 

% 


Contig 
NG50 
(koj 


Contig 
NGA50 
(KDJ 


#contigs 


Scaffold 
NG50 
(Kb) 


Scaffold 

A rn 
INGAoU 


#scaf- 
folds 


Assembly 
length 


% Full 
genes as- 
sembled 


Bcerl 


5432 65 


23 98 


43 


98 76 


88 27 


88 27 


125 


oouu. ly 


1525 36 


22 


5375 68 


96 82 


Bcer2 


5432.65 


27.16 


55 


98.80 


107.78 


107.78 


100 


1656.36 


650.25 


20 


5381.93 


97.12 


EcDHl 


4686.14 


48.19 


65 


96.97 


224.96 


224.96 


59 


4185.82 


744.63 


21 


4553.43 


96.37 


EcDH2 


4686.14 


35.12 


59 


96.60 


165.50 


165.50 


78 


2351.56 


615.09 


27 


4538.10 


95.68 


EcMGl 


4641.65 


35.44 


61 


99.10 


204.59 


202.96 


66 


4594.81 


442.95 


21 


4605.78 


98.34 


EcMG2 


4641.65 


35.35 


59 


98.96 


202.96 


202.96 


72 


4589.59 


420.34 


1!) 


4604.36 


98.14 


listl 


2944.53 


67.90 


97 


99.69 


1656.30 


1572.90 


9 


2927.87 


2189.51 


4 


2933.10 


99.29 


list.2 


2944.53 


53.34 


81 


99.59 


2424.01 


2424.01 


6 


2928.64 


2424.01 




2934.80 


99.59 


meiol 


3097.46 


54.92 


73 


99.89 


300.90 


300.90 


30 


3000.26 


876.48 


13 


3102.22 


99.23 


meio2 


3097.46 


47.80 


75 


99.89 


192.67 


192.67 


37 


3002.14 


1283.90 


9 


3101.21 


98.87 


pedl 


5167.38 


35.95 


49 


99.65 


459.30 


456.76 


48 


5147.15 


1271.26 


16 


5163.98 


99.01 


ped2 


5167.38 


27.29 


55 


99.62 


290.96 


290.96 


62 


5154.64 


3209.92 


13 


5164.31 


98.69 


pneul 


5694.89 


32.24 


57 


97.19 


171.75 


169.84 


129 


5290.29 


546.70 


33 


5547.76 


95.74 


pneu2 


5694.89 


29.31 


57 


97.47 


170.26 


170.26 


148 


3642.50 


499.86 


44 


5566.88 


95.93 


rhodl 


4602.98 


38.97 


(il 


97.68 


177.75 


177.75 


90 


4126.89 


2503.60 


14 


4499.77 


96.11 


rhod2 


4602.98 


45.05 


69 


97.89 


272.18 


272.18 


74 


3188.60 


2503.68 


16 


4513.78 


96.42 


TBI 


4419.98 


46.86 


69 


97.83 


98.39 


75.44 


100 


4360.55 


158.21 


22 


4359.53 


96.45 


TB2 


4419.98 


38.69 


53 


98.03 


116.65 


90.40 


97 


4361.85 


154.83 


23 


4367.03 


96.60 


Average 




40.20 




98.53 


406.95 


399.25 


73 


3795.32 


1223.36 


19 


4461.87 


97.47 



Table S2: Assembly metrics for Velvet assemblies when using reads trimmed with NxTrim. 





Reference 
size (kb) 


Coverage 
depth 


k-mer 
size 


Reference 
recovered 

% 


Contig 
NG50 
(kb) 


Contig 
NGA50 
(kb) 


#contigs 


Scaffold 
NG50 
(kb) 


Scaffold 
NGA50 
(kb) 


#scaf- 
folds 


Assembly 
length 
(kb) 


% Full 
genes as- 
sembled 


Bcerl 


5432.65 


20.07 


45 


98.65 


79.41 


79.41 


139 


914.45 


334.21 


29 


5375.66 


96.24 


Bcer2 


5432.65 


22.78 


57 


98.83 


101.96 


94.63 


112 


1404.36 


1264.17 


26 


5384.74 


97.14 


EcDHl 


4686.14 


40.71 


65 


96.83 


268.18 


268.18 


60 


726.29 


360.64 


30 


4551.74 


96.55 


EcDH2 


4686.14 


29.57 


63 


96.50 


167.57 


167.57 


78 


1446.85 


307.80 


26 


4535.67 


95.59 


EcMGl 


4641.65 


28.71 


63 


99.03 


204.31 


173.96 


70 


3923.66 


401.07 


24 


4609.90 


98.32 


EcMG2 


4641.65 


30.32 


57 


99.02 


180.43 


180.43 


63 


4120.26 


332.01 


17 


4610.45 


98.14 


listl 


2944.53 


58.04 


77 


99.59 


2003.18 


1499.65 


7 


2923.32 


1499.65 


4 


2928.05 


99.26 


list.2 


2944.53 


45.41 


83 


99.23 


1494.27 


1494.27 


10 


1494.27 


1494.27 


6 


2922.41 


98.95 


meiol 


3097.46 


45.98 


69 


99.81 


418.56 


418.56 


30 


2539.05 


630.38 


13 


3098.56 


99.10 


meio2 


3097.46 


40.93 


73 


99.65 


203.05 


203.05 


31 


2998.37 


1709.71 


11 


3104.66 


98.65 


pedl 


5167.38 


30.44 


57 


99.69 


410.45 


410.45 


50 


5147.21 


1657.68 


15 


5168.05 


98.92 


ped2 


5167.38 


22.99 


49 


99.59 


177.99 


177.89 


75 


4927.39 


885.71 


14 


5154.56 


98.27 


pneul 


5694.89 


27.72 


61 


97.75 


179.20 


179.20 


133 


3933.88 


538.59 


42 


5584.31 


96.40 


pneu2 


5694.89 


25.03 


61 


97.50 


131.93 


117.71 


170 


3709.20 


694.13 


46 


5571.52 


95.53 


rhodl 


4602.98 


32.62 


59 


97.74 


204.91 


204.91 


95 


4127.42 


2516.37 


21 


4509.48 


96.18 


rhod2 


4602.98 


37.98 


69 


97.40 


280.51 


280.51 


75 


3196.38 


2934.26 


15 


4490.83 


95.78 


TBI 


4419.98 


39.27 


69 


97.88 


98.39 


72.58 


100 


2550.62 


154.88 


24 


4361.38 


96.57 


TB2 


4419.98 


32.68 


51 


98.10 


105.55 


86.49 


105 


4367.59 


154.70 


24 


4373.02 


96.62 


Average 




33.96 




98.49 


372.77 


339.41 


77 


3025.03 


992.79 


21 


4463.06 


97.34 



Table S3: Assembly metrics for Velvet assemblies when using reads trimmed with the standard MiScq 
Reporter trimming routine. 
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Sample 


MP % 


UNKNOWN % 


MP+UNKNOWN% 


PE% 


SE% 


Bcerl 


53.67 


10.06 


63.73 


36.27 


42.67 


Bccr2 


49.13 


13.71 


62.84 


37.16 


40.08 


EcDHl 


43.95 


19.48 


63.43 


36.57 


37.65 


EcDH2 


45.58 


17.46 


63.03 


36.97 


38.02 


EcMGl 


45.68 


17.97 


63.65 


36.35 


38.71 


EcMG2 


40.89 


25.06 


65.96 


34.04 


35.43 


listl 


41.34 


23.78 


65.12 


34.88 


36.52 


list2 


46.21 


19.11 


65.32 


34.68 


38.72 


meiol 


49.08 


13.44 


62.52 


37.48 


40.14 


meio2 


40.91 


24.61 


65.51 


34.49 


35.24 


pedl 


45.00 


18.98 


63.98 


36.02 


38.50 


ped2 


48.99 


15.01 


64.00 


36.00 


39.95 


pneul 


38.05 


28.00 


66.05 


33.95 


33.62 


pneu2 


40.43 


24.55 


64.98 


35.02 


35.39 


rhodl 


43.50 


18.45 


61.94 


38.06 


37.04 


rhod2 


40.71 


22.10 


62.81 


37.19 


35.21 


TBI 


43.45 


16.90 


60.35 


39.65 


36.78 


TB2 


43.77 


18.16 


61.93 


38.07 


37.53 


Average 


44.24 


19.42 


63.66 


36.34 


37.51 



Table S4: Breakdown of the proportions of different virtual libraries generated by our trimming method. 
Note the sum of MP, UNKNOWN and PE constitute 100% of the read pairs that passed standard 
chastity/purity filters. MP+UNKNOWN is the proportion of reads that will have large mate-pair insert 
sizes (unknown pairs have a small amount of paired-end contamination). SE is percentage of pairs where 
a third "unpaired" single read was generated from an overhang of >21bp. Total is the average across all 
samples weighted be coverage. So typically we see «62.57% of reads with mate-pair orientation (with 
some contamination), «37.43% of reads with paired-end orientation and ~18.89% of either of these pairs 
generate an orphaned overhanged single read. 
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4 Supplementary Figures 



6000- 




10 1000 



N50 (kb) 

Figure S2: Plot of the number of genes against contig N50 (log 10 scale in kb) for different assemblies 
across all samples. Assemblies were performed for all odd k-mers between 21 and 121 which generated 
assemblies with varying contig N50s. We chose the assemblies with the highest contig N50 which appears 
to be a reasonable criteria in this scenario, given that contig N50 is strongly correlated with the number 
of genes found. Scaffold N50 was found to be a less reliable metric in this setting. 
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